<<<<<<< HEAD Interrogating sex as a biological variable in preterm birth inequity ======= Regional Analysis of Preterm Birth Characteristics >>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472 ======= @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; } pre.numberSource { margin-left: 3em; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } >>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
<<<<<<< HEAD

Interrogating sex as a biological variable in preterm birth inequity

=======

Regional Analysis of Preterm Birth Characteristics

>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472

BMIN503/EPID600 Final Project

Author

Rose Albert


<<<<<<< HEAD

1 Interrogating sex as a biological variable in preterm birth inequity

=======

1 Interrogating sex as a biological variable in county-level preterm birth inequity

>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472

December 13, 2024

By Rose Albert

1.1 Overview

<<<<<<< HEAD

Preterm birth affects approximately 10% of all births, and there is a well-established male vulnerability and female resilience during the neonatal period. The goal of this project is to interrogate sex as a biological variable using county-level data from the National Vital Statistics System queried using CDC WONDER (Wide-Ranging Online Data for Epidemiologic Research). This project will assess preterm-birth rates and NICU admissions. I have spoken with two experts in perinatal health, Dr. Aimin Chen and Dr. Heather Burris, about biological and structural determinants contributing to preterm birth inequity, as well as potential areas of clinical, public health, and policy intervention.

=======

Preterm birth affects approximately 10% of all births, and there is a well-established male vulnerability and female resilience during the neonatal period. The goal of this project is to interrogate sex as a biological variable at the county-level using CDC Wonder Natality Data and identify sex-differences in preterm birth rates and NICU admission. I have spoken with Dr. Aimin Chen and Dr. Heather Burris about biological and structural determinants contributing to preterm birth inequity, as well as potential areas of clinical, public health, and policy intervention.

>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472

The data for this project can be found in my final project GitHub repository.

1.2 Introduction

<<<<<<< HEAD

More than a third of infant deaths in the US are preterm-related (Callaghan et al., 2006). Preterm birth can predispose infants to immediate risks in the neonatal period as well as long-term adverse sequelae. For example, preterm infants are born with underdeveloped lungs and often require therapeutic interventions such as surfactant administration, antibiotics, steroids, mechanical ventilation, and supplemental oxygen for survival (Thebaud et al., 2019). However, these therapeutics can also pose postnatal insults and contribute to lung injury and chronic diseases such as bronchopulmonary dysplasia (Thebaud et al., 2019). Despite advances in neonatal care that have improved overall survival of preterm infants, rates of bronchopulmonary dysplasia have continued to rise (Bell et al., 2022). Bronchopulmonary dysplasia is one of many male-biased complications of prematurity (van Westering-Kroon et al. 2021). In addition to a male disadvantage in neonatal outcomes, there is also a male vulnerability for pregnancy complications and indications for preterm birth (Inkster et al., 2021).

Interrogation of sex as a biological variable has received increasing attention as a critical area for understanding basic pathophysiology and differential response to therapeutics (Albert, Lee, & Lingappan, 2023). The NIH has also established consideration of sex as a biological variable as a research mandate (“NOT-OD-15-102”, 2015). This project will investigate male susceptibility to preterm birth and NICU admission using publicly available county-level data. An understanding of preterm birth characteristics can inform neonatal care toward precision medicine approaches.

1.3 Methods

Data sources

County-level Birth Characteristics Data Source: I am using the CDC Wonder Natality Data (2023) grouped by “County of Residence,” “Sex of Infant,” “NICU Admission,” and “OE Gestational Age Recode 10,” with the measures “Births” and “Percent of Total Births.”

Mapping Data Source: I am drawing from Assignment 5 to use geospatial data, tidy census, and leaflet to generate interactive maps.update.packages(“tidycensus”)

Data curation and cleaning

To work with this dataset, I loaded the necessary libraries and merged the county population data with the birth data by “County.of.Residence.Code”. To clean the data, I am removing redundant columns and renaming my variables. I am also creating a new variable for preterm birth by grouping the gestational ages that are <37 weeks. I then merge this with my county shape files.

=======

Preterm birth can predispose infants to immediate risks in the neonatal period as well as long-term adverse sequelae. For example, preterm infants are born with underdeveloped lungs and often require therapeutic interventions such as surfactant administration, antibiotics, steroids, mechanical ventilation, and supplemental oxygen for survival. However, these therapeutics can also pose postnatal insults and contribute to lung injury and diseases such as bronchopulmonary dysplasia.

Interrogation of sex as a biological variable has received increasing attention as a critical area for understanding basic pathophysiology and differential response to therapeutics. Male sex is an independent risk factor for numerous pediatric diseases and diseases of prematurity. This project will investigate male susceptibility to preterm birth and NICU admission at the county-level. An understanding of preterm birth characteristics can inform local public health interventions.

1.3 Methods

Data curation and cleaning

County-level Birth Characteristics Data Source: I am using the CDC Wonder Natality Data (2023) grouped by “County of Residence,” “Sex of Infant,” “NICU Admission,” and “OE Gestational Age Recode 10,” with the measures “Births” and “Percent of Total Births.”

Mapping Data Source: I am drawing from Assignment 5 to use geospatial data, tidy census, and leaflet to generate interactive maps.update.packages(“tidycensus”)

>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
#load necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
library(tidycensus)
library(leaflet)
options(tigris_use_cache = TRUE)
options(progress_enabled = FALSE)

#Load county-level birth data downloaded from CDC Wonder
<<<<<<< HEAD
birth_data <- read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded_sex.txt")

county_data <- read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded%20_county.txt")
=======
birth_data <- read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded_sex.txt")

county_data <- read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded%20_county.txt")
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472

#view column names
colnames(birth_data)
<<<<<<< HEAD
 [1] "Notes"                             "County.of.Residence"              
 [3] "County.of.Residence.Code"          "Sex.of.Infant"                    
 [5] "Sex.of.Infant.Code"                "NICU.Admission"                   
 [7] "NICU.Admission.Code"               "OE.Gestational.Age.Recode.10"     
 [9] "OE.Gestational.Age.Recode.10.Code" "Births"                           
[11] "X..of.Total.Births"               
colnames(county_data)
[1] "Notes"                    "County.of.Residence"     
[3] "County.of.Residence.Code" "Births"                  
[5] "Total.Population"         "Birth.Rate"              
[7] "X..of.Total.Births"      
#merge birth_data and county_data
merged_data <- left_join(birth_data, county_data, by = "County.of.Residence.Code")
Warning in left_join(birth_data, county_data, by = "County.of.Residence.Code"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 11803 of `x` matches multiple rows in `y`.
ℹ Row 250 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
=======
 [1] "Notes"                             "County.of.Residence"              
 [3] "County.of.Residence.Code"          "Sex.of.Infant"                    
 [5] "Sex.of.Infant.Code"                "NICU.Admission"                   
 [7] "NICU.Admission.Code"               "OE.Gestational.Age.Recode.10"     
 [9] "OE.Gestational.Age.Recode.10.Code" "Births"                           
[11] "X..of.Total.Births"               
colnames(county_data)
[1] "Notes"                    "County.of.Residence"     
[3] "County.of.Residence.Code" "Births"                  
[5] "Total.Population"         "Birth.Rate"              
[7] "X..of.Total.Births"      
#merge birth_data and county_data
merged_data <- left_join(birth_data, county_data, by = "County.of.Residence.Code")
Warning in left_join(birth_data, county_data, by = "County.of.Residence.Code"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 11803 of `x` matches multiple rows in `y`.
ℹ Row 250 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
#clean merged_data 
library(dplyr)

# clean merged data by removing unnecessary columns, checking discrepancies, and renaming
cleaned_data <- merged_data %>%
<<<<<<< HEAD
    # Remove 'Notes.x' column
    select(-Notes.x) %>%
    
    # Check for discrepancies between 'County.of.Residence.x' and 'County.of.Residence.y'
=======
    # Remove 'Notes.x' column
    select(-Notes.x) %>%
    
    # Check for discrepancies between 'County.of.Residence.x' and 'County.of.Residence.y'
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
    # 
    mutate(
        County.of.Residence = ifelse(
            County.of.Residence.x == County.of.Residence.y, 
            County.of.Residence.x, 
            NA  
        )
    ) %>%
    
<<<<<<< HEAD
    # Merge 'County.of.Residence.x' and 'County.of.Residence.y'
    select(-County.of.Residence.x, -County.of.Residence.y) %>%
    
    # Rename 'Births.y' to 'Total.Births'
=======
    # Merge 'County.of.Residence.x' and 'County.of.Residence.y'
    select(-County.of.Residence.x, -County.of.Residence.y) %>%
    
    # Rename 'Births.y' to 'Total.Births'
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
    rename(
        Total.Births = Births.y
    ) %>%
    
<<<<<<< HEAD
    # Remove 'Notes.y' column
=======
    # Remove 'Notes.y' column
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
    select(-Notes.y) %>%
  
    # Remove sex of infant code
    select(-Sex.of.Infant.Code) %>%
    
<<<<<<< HEAD
    # Remove 'X..of.Total.Births.y' and 'X..of.Total.Births.x'
    select(-c(X..of.Total.Births.x, X..of.Total.Births.y)) %>%
  
   # Remove 'OE.Gestational.Age.Recode.10.Code'
    select(-OE.Gestational.Age.Recode.10.Code) %>%
  
   # Remove 'NICU.Admission.Code'
    select(-NICU.Admission.Code) %>%
  
    # Filter out rows where 'NICU.Admission' or 'OE.Gestational.Age.Recode.10' are "Unknown"
    filter(
        !NICU.Admission %in% c("Unknown or Not Stated", ""),
        !OE.Gestational.Age.Recode.10 %in% c("Unknown or Not Stated", "")
=======
    # Remove 'X..of.Total.Births.y' and 'X..of.Total.Births.x'
    select(-c(X..of.Total.Births.x, X..of.Total.Births.y)) %>%
  
   # Remove 'OE.Gestational.Age.Recode.10.Code'
    select(-OE.Gestational.Age.Recode.10.Code) %>%
  
   # Remove 'NICU.Admission.Code'
    select(-NICU.Admission.Code) %>%
  
    # Filter out rows where 'NICU.Admission' or 'OE.Gestational.Age.Recode.10' are "Unknown"
    filter(
        !NICU.Admission %in% c("Unknown or Not Stated", ""),
        !OE.Gestational.Age.Recode.10 %in% c("Unknown or Not Stated", "")
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
    ) %>%
    
    # Remove any duplicates
    distinct()

cleaned_data <- cleaned_data %>%
  mutate(County.of.Residence.Code = as.character(County.of.Residence.Code))

<<<<<<< HEAD

# Convert 'County.of.Residence.Code' to character before merging
preterm_rate_data <- cleaned_data %>%
   mutate(
        Is.Preterm = OE.Gestational.Age.Recode.10 %in% c("28 - 31 weeks", "32 - 35 weeks", "36 weeks", "20 - 27 weeks", "Under 20 weeks"),
        County.of.Residence.Code = as.character(County.of.Residence.Code)  # Convert to character
    ) %>%
   group_by(County.of.Residence, County.of.Residence.Code) %>%  # Keep 'County.of.Residence.Code' in the group_by
    summarise(
        Preterm.Births = sum(ifelse(Is.Preterm, Births.x, 0), na.rm = TRUE),
        Preterm.Birth.Rate = (Preterm.Births / Total.Births) * 100
    ) %>%
   ungroup()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'County.of.Residence',
'County.of.Residence.Code'. You can override using the `.groups` argument.
#load census shapefile data and merge by county FIPs 
counties_sf <- counties(cb = TRUE, resolution = "20m") # cb = TRUE for a simplified geometry
Retrieving data for the year 2022
# Merge the cleaned data with the counties shapefile
county_map_preterm_data <- left_join(counties_sf, preterm_rate_data, by = c("GEOID" = "County.of.Residence.Code"))

county_map_cleaned_data <- left_join(counties_sf, cleaned_data, by = c("GEOID" = "County.of.Residence.Code"))

# Ensure the 'Sex.of.Infant' column is available and that it's a factor or character
cleaned_data$Sex.of.Infant <- as.factor(cleaned_data$Sex.of.Infant)

To create a data frame to assess preterm birth rates, I stratified the preterm-birth data by sex and calculated the preterm birth rate for each county by dividing the number of preterm births by the total number of births in each county, multiplied by 100 to report as a percentage.

# Create a new data frame stratified by sex
preterm_sex_data <- cleaned_data %>%
  mutate(
    Is.Preterm = OE.Gestational.Age.Recode.10 %in% c("28 - 31 weeks", "32 - 35 weeks", "36 weeks", "20 - 27 weeks", "Under 20 weeks"),
    County.of.Residence.Code = as.character(County.of.Residence.Code)  # Convert to character
  ) %>%
  group_by(County.of.Residence, County.of.Residence.Code, Sex.of.Infant) %>%
  summarise(
    Preterm.Births = sum(ifelse(Is.Preterm, Births.x, 0), na.rm = TRUE),
    Preterm.Birth.Rate = (Preterm.Births / Total.Births) * 100
  ) %>%
  ungroup()
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'County.of.Residence',
'County.of.Residence.Code', 'Sex.of.Infant'. You can override using the
`.groups` argument.

Analysis plan

Exploratory analysis of data:

  • Generate county-level choropleth maps of birth rates ((births/total population)*1000 for each county).

  • Generate county-level choropleth maps of preterm birth rates (<37 gestational age births/total population).

  • Generate bar graphs showing sex-stratified distribution of births by gestational age groups.

  • Generate bar graphs showing sex-stratified distribution of NICU admission rates by gestational age groups.

  • Conduct logistic regression to identify relationships between gestational age and sex with the outcome of NICU admission

Additional data analysis: Identify counties with highest and lowest preterm birth rates, and the highest and lowest NICU admission rates.

1.4 Results

First, I developed county-level choropleth maps of birth rates (live births per 1000 people) for the 496 counties represented in this dataset. I used this output to identify the county with the highest birth rate (Rockland County, NY; 19.4) and the lowest birth rate (Charlotte County, FL; 4.99). The average birth rate was 10.81.

======= # View cleaned data head(cleaned_data)
  County.of.Residence.Code Sex.of.Infant NICU.Admission
1                    24033        Female             No
2                    24033          Male             No
3                    24037          Male            Yes
4                    24999        Female             No
5                    24999          Male            Yes
6                    25003        Female             No
  OE.Gestational.Age.Recode.10 Births.x Total.Births Total.Population
1                28 - 31 weeks       10        10527           947430
2             42 weeks or more       10        10527           947430
3                32 - 35 weeks       10         1309           115281
4                28 - 31 weeks       10         5546           550411
5                     40 weeks       10         5546           550411
6                     36 weeks       10          890           126818
  Birth.Rate        County.of.Residence
1      11.11 Prince George's County, MD
2      11.11 Prince George's County, MD
3      11.35      St. Mary's County, MD
4      10.08  Unidentified Counties, MD
5      10.08  Unidentified Counties, MD
6       7.02       Berkshire County, MA
# Convert 'County.of.Residence.Code' to character before merging
preterm_rate_data <- cleaned_data %>%
   mutate(
        Is.Preterm = OE.Gestational.Age.Recode.10 %in% c("28 - 31 weeks", "32 - 35 weeks", "36 weeks", "20 - 27 weeks", "Under 20 weeks"),
        County.of.Residence.Code = as.character(County.of.Residence.Code)  # Convert to character
    ) %>%
   group_by(County.of.Residence, County.of.Residence.Code) %>%  # Keep 'County.of.Residence.Code' in the group_by
    summarise(
        Total.Births = sum(Total.Births, na.rm = TRUE),
        Preterm.Births = sum(ifelse(Is.Preterm, Births.x, 0), na.rm = TRUE),
        Preterm.Birth.Rate = (Preterm.Births / Total.Births) * 100
    ) %>%
   ungroup()
`summarise()` has grouped output by 'County.of.Residence'. You can override
using the `.groups` argument.
# View the result to ensure 'County.of.Residence.Code' is retained
head(preterm_rate_data)
# A tibble: 6 × 5
  County.of.Residence County.of.Residence.Code Total.Births Preterm.Births
  <chr>               <chr>                           <int>          <dbl>
1 Ada County, ID      16001                          122064            408
2 Adams County, CO    8001                           139436            566
3 Adams County, PA    42001                            9130             26
4 Aiken County, SC    45003                           26698            169
5 Alachua County, FL  12001                           48317            266
6 Alamance County, NC 37001                           28320            158
# ℹ 1 more variable: Preterm.Birth.Rate <dbl>
#load census shapefile data and merge by county FIPs 
counties_sf <- counties(cb = TRUE, resolution = "20m") # cb = TRUE for a simplified geometry
Retrieving data for the year 2022
# Merge the cleaned data with the counties shapefile
county_map_preterm_data <- left_join(counties_sf, preterm_rate_data, by = c("GEOID" = "County.of.Residence.Code"))

county_map_cleaned_data <- left_join(counties_sf, cleaned_data, by = c("GEOID" = "County.of.Residence.Code"))

# Integrate with US Census Data

Analysis plan

Exploratory analysis of data: Generate county-level choropleth maps of birth rates ((births/total population)*1000 for each county). Generate county-level choropleth maps of preterm birth rates (<37 gestational age births/total population). Generate sex-stratified maps of preterm birth rates. Generate sex-stratified choropleth maps of NICU admissions by county. Generate bar graphs showing NICU admission rates by sex and sex distribution of gestational age groups.

Data analysis: Conduct 2-way ANOVA to determine if sex as a biological variable is a significant predictor of NICU admission. Identify counties with highest and lowest preterm birth rates, and the highest and lowest NICU admission rates.

1.4 Results

First, I developed a county-level choropleth maps of birth rates (live births per 1000 people) for counties that did not have any missing data. The below map represents 496 counties. Rockland County, NY has the highest birth rate (19.4 per 1000 people) while Charlotte County, FL has the lowest birth rate (4.99 per 1000 people).

>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
#load necessary libraries
library(tidyverse)
library(sf)
library(tidycensus)
library(ggspatial)
library(leaflet)
library(cowplot)

<<<<<<< HEAD
Attaching package: 'cowplot'
The following object is masked from 'package:lubridate':
=======
Attaching package: 'cowplot'
The following object is masked from 'package:lubridate':
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472

    stamp
library(nhanesA)
library(haven)
library(viridis)
Loading required package: viridisLite
# Convert Birth.Rate to numeric (if necessary)
county_map_cleaned_data$Birth.Rate <- as.numeric(county_map_cleaned_data$Birth.Rate)

# Recheck the structure of the column
str(county_map_cleaned_data$Birth.Rate)
 num [1:11636] NA NA NA NA NA ...
# Remove rows where Birth.Rate is NA
county_map_cleaned_data <- county_map_cleaned_data %>%
  filter(!is.na(Birth.Rate))

#load map theme
map_theme <- function() {
  theme_minimal() +  
    theme(
      axis.line = element_blank(), 
      axis.text = element_blank(),  
      axis.title = element_blank(),
<<<<<<< HEAD
      panel.grid = element_line(color = "white"), 
      legend.key.size = unit(0.8, "cm"),     
=======
      panel.grid = element_line(color = "white"), 
      legend.key.size = unit(0.8, "cm"),     
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
      legend.text = element_text(size = 16),   
      legend.title = element_text(size = 16)
    )
}

#create my palette
myPalette <- colorRampPalette(viridis(100))

# Generate the choropleth map for birth rates
birth_rates_map <- ggplot(data = county_map_cleaned_data) +
  geom_sf(aes(fill = Birth.Rate)) +  # Map Preterm Birth Rate to the fill
  map_theme() +  # Apply the custom theme
<<<<<<< HEAD
  ggtitle("2023 Birth Rate (live births per 1000 people)") +
  scale_fill_gradientn(name = "Birth Rate\nrate", colours = myPalette(100))
=======
  ggtitle("2023 Birth Rate (live births per 1000 people)") +
  scale_fill_gradientn(name = "Birth Rate\nrate", colours = myPalette(100))
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
  coord_sf(xlim = c(-79.5, -75), ylim = c(37, 39))
<ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
    aspect: function
    backtransform_range: function
    clip: on
    crs: NULL
    datum: crs
    default: FALSE
    default_crs: NULL
    determine_crs: function
    distance: function
    expand: TRUE
    fixup_graticule_labels: function
    get_default_crs: function
    is_free: function
    is_linear: function
    label_axes: list
    label_graticule: 
    labels: function
    limits: list
    lims_method: cross
    modify_scales: function
    ndiscr: 100
    params: list
    range: function
    record_bbox: function
    render_axis_h: function
    render_axis_v: function
    render_bg: function
    render_fg: function
    setup_data: function
    setup_layout: function
    setup_panel_guides: function
    setup_panel_params: function
    setup_params: function
    train_panel_guides: function
    transform: function
    super:  <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
# Print the map
print(birth_rates_map)
<<<<<<< HEAD

=======

>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
#How many counties are represented in this analysis?
num_unique_geoid_counties <- length(unique(county_map_cleaned_data$GEOID))
print(num_unique_geoid_counties)
[1] 496
#Which counties have the highest and lowest birth rates?
# Find the county with the highest Birth Rate and its value (show only the first entry)
highest_birth_rate <- county_map_cleaned_data %>%
  filter(Birth.Rate == max(Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

<<<<<<< HEAD
cat("County with the highest Birth Rate: ", highest_birth_rate$County.of.Residence, "\n")
County with the highest Birth Rate:  Rockland County, NY 
cat("Highest Birth Rate: ", highest_birth_rate$Birth.Rate, "\n")
======= cat("County with the highest Birth Rate: ", highest_birth_rate$County.of.Residence, "\n")
County with the highest Birth Rate:  Rockland County, NY 
cat("Highest Birth Rate: ", highest_birth_rate$Birth.Rate, "\n")
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
Highest Birth Rate:  19.4 
# Find the county with the lowest Birth Rate and its value (show only the first entry)
<<<<<<< HEAD

lowest_birth_rate <- county_map_cleaned_data %>%
  filter(Birth.Rate == min(Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

cat("County with the lowest Birth Rate: ", lowest_birth_rate$County.of.Residence, "\n")
County with the lowest Birth Rate:  Charlotte County, FL 
cat("Lowest Birth Rate: ", lowest_birth_rate$Birth.Rate, "\n")
Lowest Birth Rate:  4.99 
# Calculate the average Birth Rate
average_birth_rate <- county_map_cleaned_data %>%
  summarise(Average_Birth_Rate = mean(Birth.Rate, na.rm = TRUE))

cat("Average Birth Rate: ", average_birth_rate$Average_Birth_Rate, "\n")
Average Birth Rate:  10.81442 

To better visualize and explore the birth rates, I then generated an interactive choropleth map of birth rates by county. This made it easier to identify regional trends and easily navigate to counties of interest such as Philadelphia County. Philadelphia County has a birth rate above average at 11.8 live births per 1000 people. My hometown in Hamilton County, TN has a similar birth rate at 11.9 live births per 1000 people.

#generate interactive choropleth map 
library(leaflet)
library(RColorBrewer)

# Define the color scale for the Birth Rate using colorBin
pal_fun <- colorBin(palette = brewer.pal(9, "YlOrRd")[c(1:5, 7)], 
                    bins = c(0, 5, 10, 15, 20, 25), 
                    reverse = FALSE)

# Create the popup message for each county
pu_message <- paste0(county_map_cleaned_data$County.of.Residence, 
                     "<br>Birth Rate: ",      
                     round(county_map_cleaned_data$Birth.Rate, 1), " live births per 1000 people")

# Create the leaflet map with the polygons
leaflet(county_map_cleaned_data) |>
  addPolygons(stroke = FALSE,               
              fillColor = ~pal_fun(Birth.Rate),
              fillOpacity = 0.7, smoothFactor = 0.5,
              popup = pu_message) |>   
  addProviderTiles(providers$CartoDB.Positron) |>   
  addLegend("bottomright",               
            pal = pal_fun,              
            values = ~Birth.Rate,  
            title = 'Birth Rate',    
            opacity = 1) |>             
  addScaleBar()
Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'

I repeated the analysis to generate county-level choropleth maps of preterm birth rates ((<37 gestational age births/total number of live births)*100). The average preterm birth rate of the counties in this dataset is 9.5%. Hinds County, MS has the highest preterm birth rate (17.85%) while Tompkins County, NY has the lowest preterm birth rate (2.63%).

# Convert Preterm.Birth.Rate to numeric
county_map_preterm_data$Preterm.Birth.Rate <- as.numeric(county_map_preterm_data$Preterm.Birth.Rate)

# Recheck the structure of the column
str(county_map_preterm_data$Preterm.Birth.Rate)
 num [1:11636] NA NA NA NA NA ...
# Remove rows where Birth.Rate is NA
county_map_preterm_data <- county_map_preterm_data %>%
  filter(!is.na(Preterm.Birth.Rate))

# Generate the choropleth map for preterm birth rates
preterm_birth_rates_map <- ggplot(data = county_map_preterm_data) +
  geom_sf(aes(fill = Preterm.Birth.Rate)) +  # Map Preterm Birth Rate to the fill
  map_theme() +  # Apply the custom theme
  ggtitle("2023 Preterm Birth Rate % (<37 gestational age births/total number of live births)") +
  scale_fill_gradientn(name = "Preterm Birth Rate\nrate", colours = myPalette(100))
  coord_sf(xlim = c(-79.5, -75), ylim = c(37, 39))
======= lowest_birth_rate <- county_map_cleaned_data %>% filter(Birth.Rate == min(Birth.Rate, na.rm = TRUE)) %>% slice(1) # Select the first row cat("County with the lowest Birth Rate: ", lowest_birth_rate$County.of.Residence, "\n")
County with the lowest Birth Rate:  Charlotte County, FL 
cat("Lowest Birth Rate: ", lowest_birth_rate$Birth.Rate, "\n")
Lowest Birth Rate:  4.99 

To better visualize and explore the birth rates, I then generated an interactive choropleth map of birth rates by county.

#generate interactive choropleth map 
library(leaflet)
library(RColorBrewer)

# Define the color scale for the Birth Rate using colorBin
pal_fun <- colorBin(palette = brewer.pal(9, "YlOrRd")[c(1:5, 7)], 
                    bins = c(0, 5, 10, 15, 20, 25), 
                    reverse = FALSE)

# Create the popup message for each county
pu_message <- paste0(county_map_cleaned_data$County.of.Residence, 
                     "<br>Birth Rate: ",      
                     round(county_map_cleaned_data$Birth.Rate, 1), " live births per 1000 people")

# Create the leaflet map with the polygons
leaflet(county_map_cleaned_data) |>
  addPolygons(stroke = FALSE,               
              fillColor = ~pal_fun(Birth.Rate),
              fillOpacity = 0.7, smoothFactor = 0.5,
              popup = pu_message) |>   
  addProviderTiles(providers$CartoDB.Positron) |>   
  addLegend("bottomright",               
            pal = pal_fun,              
            values = ~Birth.Rate,  
            title = 'Birth Rate',    
            opacity = 1) |>             
  addScaleBar()
Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'

I repeated the analysis to generate county-level choropleth maps of preterm birth rates (<37 gestational age births/total number of live births).

# Convert Preterm.Birth.Rate to numeric
county_map_preterm_data$Preterm.Birth.Rate <- as.numeric(county_map_preterm_data$Preterm.Birth.Rate)

# Recheck the structure of the column
str(county_map_preterm_data$Preterm.Birth.Rate)
 num [1:3222] NA NA NA NA NA ...
# Remove rows where Birth.Rate is NA
county_map_preterm_data <- county_map_preterm_data %>%
  filter(!is.na(Preterm.Birth.Rate))

# Generate the choropleth map for preterm birth rates
pretetrm_birth_rates_map <- ggplot(data = county_map_preterm_data) +
  geom_sf(aes(fill = Preterm.Birth.Rate)) +  # Map Preterm Birth Rate to the fill
  map_theme() +  # Apply the custom theme
  ggtitle("2023 Preterm Birth Rate % (<37 gestational age births/total number of live births)") +
  scale_fill_gradientn(name = "Preterm Birth Rate\nrate", colours = myPalette(100))
  coord_sf(xlim = c(-79.5, -75), ylim = c(37, 39))
>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472
<ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
    aspect: function
    backtransform_range: function
    clip: on
    crs: NULL
    datum: crs
    default: FALSE
    default_crs: NULL
    determine_crs: function
    distance: function
    expand: TRUE
    fixup_graticule_labels: function
    get_default_crs: function
    is_free: function
    is_linear: function
    label_axes: list
    label_graticule: 
    labels: function
    limits: list
    lims_method: cross
    modify_scales: function
    ndiscr: 100
    params: list
    range: function
    record_bbox: function
    render_axis_h: function
    render_axis_v: function
    render_bg: function
    render_fg: function
    setup_data: function
    setup_layout: function
    setup_panel_guides: function
    setup_panel_params: function
    setup_params: function
    train_panel_guides: function
    transform: function
    super:  <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
<<<<<<< HEAD
preterm_birth_rates_map

#How many counties are represented in this analysis?
num_unique_geoid_counties_preterm <- length(unique(county_map_preterm_data$GEOID))
print(num_unique_geoid_counties_preterm)
[1] 496
#Which counties have the highest and lowest preterm birth rates?
# Find the county with the highest preterm Birth Rate and its value (show only the first entry)
highest_preterm_birth_rate <- county_map_preterm_data %>%
  filter(Preterm.Birth.Rate == max(Preterm.Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

cat("County with the highest preterm Birth Rate: ", highest_preterm_birth_rate$County.of.Residence, "\n")
County with the highest preterm Birth Rate:  Hinds County, MS 
cat("Highest preterm Birth Rate: ", highest_preterm_birth_rate$Preterm.Birth.Rate, "\n")
Highest preterm Birth Rate:  17.85029 
# Find the county with the lowest preterm Birth Rate and its value (show only the first entry)
lowest_preterm_birth_rate <- county_map_preterm_data %>%
  filter(Preterm.Birth.Rate == min(Preterm.Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

cat("County with the lowest preterm Birth Rate: ", lowest_preterm_birth_rate$County.of.Residence, "\n")
County with the lowest preterm Birth Rate:  Tompkins County, NY 
cat("Lowest preterm Birth Rate: ", lowest_preterm_birth_rate$Preterm.Birth.Rate, "\n")
Lowest preterm Birth Rate:  2.631579 
#What is the average preterm birth rate? 
# Calculate the average preterm birth rate
average_preterm_birth_rate <- mean(county_map_preterm_data$Preterm.Birth.Rate, na.rm = TRUE)

cat("Average preterm birth rate: ", round(average_preterm_birth_rate, 2), "\n")
Average preterm birth rate:  9.5 

I now want a similarly interactive map of preterm birth data. From this, I can identify that Philadelphia has an above average preterm birth rate at 11%. My hometown in Hamilton County, TN has a similar preterm birth rate at 10.6%.

#generate interactive choropleth map 
library(leaflet)
library(RColorBrewer)

# Define the color scale for the Birth Rate using colorBin
pal_fun <- colorBin(palette = brewer.pal(9, "YlOrRd")[c(1:5, 7)], 
                    bins = c(0, 5, 10, 15, 20, 25), 
                    reverse = FALSE)

county_map_preterm_data$Preterm.Birth.Rate <- as.numeric(county_map_preterm_data$Preterm.Birth.Rate)

# Create the popup message for each county
pu_message <- paste0(county_map_preterm_data$County.of.Residence, 
                     "<br>Preterm Birth Rate: ",      
                     round(county_map_preterm_data$Preterm.Birth.Rate, 1), "%")

# Create the leaflet map with the polygons
leaflet(county_map_preterm_data) |>
  addPolygons(stroke = FALSE,               
              fillColor = ~pal_fun(Preterm.Birth.Rate),
              fillOpacity = 0.7, smoothFactor = 0.5,
              popup = pu_message) |>   
  addProviderTiles(providers$CartoDB.Positron) |>   
  addLegend("bottomright",               
            pal = pal_fun,              
            values = ~Preterm.Birth.Rate,  
            title = 'Preterm Birth Rate',    
            opacity = 1) |>             
  addScaleBar()
Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'

I now want to interrogate the total number of births by gestational age group and visualize the distribution using a bar graph. I can see that most of the infants are born at 32-35 weeks and 37-39 weeks gestational age. The <20 week gestational age group is least represented.

# Summarize the data by gestational age for total births
gestational_births_summary <- cleaned_data %>%
  group_by(OE.Gestational.Age.Recode.10) %>%
  summarise(
    Total_Births = n()  # Count the total number of births in each gestational age category
  ) %>%
  ungroup()

# Plot the total number of births by gestational age
ggplot(gestational_births_summary, aes(x = OE.Gestational.Age.Recode.10, y = Total_Births, fill = OE.Gestational.Age.Recode.10)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(
    title = "Total Number of Births by Gestational Age",
    x = "Gestational Age at Birth",
    y = "Total Number of Births"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Now I want to explore the proportion of NICU admission by gestational age. We can see a general trend where lower gestational ages have a higher proportion of NICU admission. Notably, infants less than 20 weeks old do not follow this trend, though this may represent survivor bias (infants born less than 20 weeks old may not survive for very long after birth or are inadequately powered to be assessed in this analysis). Infants born 20-31 weeks gestational age had the highest proportion of NICU admission.

# Summarizing the data by gestational age and NICU admission status
gestational_nicu_summary <- cleaned_data %>%
  group_by(OE.Gestational.Age.Recode.10) %>%
  summarise(
    Total_Births = n(),  # Count the total number of births in each gestational age category (both Yes and No)
    NICU_Admissions = sum(NICU.Admission == "Yes", na.rm = TRUE),  # Count NICU admissions
    Proportion_NICU_Admission = NICU_Admissions / Total_Births  # Calculate proportion of NICU admissions
  ) %>%
  ungroup()

# Now, plot the proportion of NICU admissions by gestational age
library(ggplot2)

ggplot(gestational_nicu_summary, aes(x = OE.Gestational.Age.Recode.10, y = Proportion_NICU_Admission, fill = OE.Gestational.Age.Recode.10)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(
    title = "Proportion of NICU Admissions by Gestational Age",
    x = "Gestational Age at Birth",
    y = "Proportion of NICU Admissions"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I repeated this analysis stratified by sex, first by visualizing the number of births per gestational age.

gestational_nicu_sex_summary <- cleaned_data %>%
  group_by(OE.Gestational.Age.Recode.10, Sex.of.Infant) %>%
  summarise(
    Total_Births = n()  # Count the total number of infants for each gestational age and sex
  ) %>%
  ungroup()
`summarise()` has grouped output by 'OE.Gestational.Age.Recode.10'. You can
override using the `.groups` argument.
# Print the summarized data to check
print(gestational_nicu_sex_summary)
# A tibble: 18 × 3
   OE.Gestational.Age.Recode.10 Sex.of.Infant Total_Births
   <chr>                        <fct>                <int>
 1 20 - 27 weeks                Female                 278
 2 20 - 27 weeks                Male                   315
 3 28 - 31 weeks                Female                 399
 4 28 - 31 weeks                Male                   430
 5 32 - 35 weeks                Female                1102
 6 32 - 35 weeks                Male                  1127
 7 36 weeks                     Female                 957
 8 36 weeks                     Male                  1042
 9 37 - 39 weeks                Female                1221
10 37 - 39 weeks                Male                  1235
11 40 weeks                     Female                 892
12 40 weeks                     Male                   937
13 41 weeks                     Female                 720
14 41 weeks                     Male                   745
15 42 weeks or more             Female                 110
16 42 weeks or more             Male                   134
17 Under 20 weeks               Female                   3
18 Under 20 weeks               Male                     6
library(ggplot2)

ggplot(gestational_nicu_sex_summary, aes(x = OE.Gestational.Age.Recode.10, y = Total_Births, fill = Sex.of.Infant)) +
  geom_bar(stat = "identity", position = "dodge") +  # Use 'dodge' for separate bars for each sex
  labs(
    title = "Number of Infants Born by Gestational Age and Sex",
    x = "Gestational Age at Birth",
    y = "Total Number of Infants"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I then stratified this data by sex to visualize gestational age and proportion of NICU admissions.

gestational_nicu_sex_summary <- cleaned_data %>%
  group_by(OE.Gestational.Age.Recode.10, Sex.of.Infant) %>%
  summarise(
    Total_Births = n(),  # Count the total number of births in each gestational age category and sex
    NICU_Admissions = sum(NICU.Admission == "Yes", na.rm = TRUE),  # Count NICU admissions
    Proportion_NICU_Admission = NICU_Admissions / Total_Births  # Calculate proportion of NICU admissions
  ) %>%
  ungroup()
`summarise()` has grouped output by 'OE.Gestational.Age.Recode.10'. You can
override using the `.groups` argument.
# Print the summarized data to check
print(gestational_nicu_sex_summary)
# A tibble: 18 × 5
   OE.Gestational.Age.Recode.10 Sex.of.Infant Total_Births NICU_Admissions
   <chr>                        <fct>                <int>           <int>
 1 20 - 27 weeks                Female                 278             226
 2 20 - 27 weeks                Male                   315             258
 3 28 - 31 weeks                Female                 399             360
 4 28 - 31 weeks                Male                   430             391
 5 32 - 35 weeks                Female                1102             609
 6 32 - 35 weeks                Male                  1127             618
 7 36 weeks                     Female                 957             337
 8 36 weeks                     Male                  1042             425
 9 37 - 39 weeks                Female                1221             595
10 37 - 39 weeks                Male                  1235             609
11 40 weeks                     Female                 892             266
12 40 weeks                     Male                   937             311
13 41 weeks                     Female                 720             108
14 41 weeks                     Male                   745             140
15 42 weeks or more             Female                 110               0
16 42 weeks or more             Male                   134               1
17 Under 20 weeks               Female                   3               0
18 Under 20 weeks               Male                     6               0
# ℹ 1 more variable: Proportion_NICU_Admission <dbl>
library(ggplot2)

ggplot(gestational_nicu_sex_summary, aes(x = OE.Gestational.Age.Recode.10, y = Proportion_NICU_Admission, fill = Sex.of.Infant)) +
  geom_bar(stat = "identity", position = "dodge") +  # Use 'dodge' for separate bars for each sex
  labs(
    title = "Proportion of NICU Admissions by Gestational Age and Sex",
    x = "Gestational Age at Birth",
    y = "Proportion of NICU Admissions"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I then ran a logistic regression with gestational age only. Each variable was significant except for gestational age <20 weeks. The <20 weeks age group also has a high standard error (108.2). Moving forward, this will be excluded in the analysis.

# Convert NICU.Admission to binary (1 for "Yes", 0 for "No")
 cleaned_data$NICU.Admission <- ifelse(cleaned_data$NICU.Admission == "Yes", 1, 0)
  
 # Fit the logistic regression model again
 logistic_model <- glm(NICU.Admission ~ OE.Gestational.Age.Recode.10,  
                      data = cleaned_data, 
                      family = "binomial")

# Summary of the logistic regression model
summary(logistic_model)

Call:
glm(formula = NICU.Admission ~ OE.Gestational.Age.Recode.10, 
    family = "binomial", data = cleaned_data)

Coefficients:
                                             Estimate Std. Error z value
(Intercept)                                    1.4907     0.1060  14.061
OE.Gestational.Age.Recode.1028 - 31 weeks      0.7740     0.1594   4.857
OE.Gestational.Age.Recode.1032 - 35 weeks     -1.2882     0.1143 -11.275
OE.Gestational.Age.Recode.1036 weeks          -1.9752     0.1156 -17.088
OE.Gestational.Age.Recode.1037 - 39 weeks     -1.5298     0.1134 -13.485
OE.Gestational.Age.Recode.1040 weeks          -2.2654     0.1174 -19.304
OE.Gestational.Age.Recode.1041 weeks          -3.0815     0.1269 -24.289
OE.Gestational.Age.Recode.1042 weeks or more  -6.9838     1.0076  -6.931
OE.Gestational.Age.Recode.10Under 20 weeks   -14.0568   108.2480  -0.130
                                             Pr(>|z|)    
(Intercept)                                   < 2e-16 ***
OE.Gestational.Age.Recode.1028 - 31 weeks    1.19e-06 ***
OE.Gestational.Age.Recode.1032 - 35 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1036 weeks          < 2e-16 ***
OE.Gestational.Age.Recode.1037 - 39 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1040 weeks          < 2e-16 ***
OE.Gestational.Age.Recode.1041 weeks          < 2e-16 ***
OE.Gestational.Age.Recode.1042 weeks or more 4.19e-12 ***
OE.Gestational.Age.Recode.10Under 20 weeks      0.897    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16042  on 11652  degrees of freedom
Residual deviance: 13837  on 11644  degrees of freedom
AIC: 13855

Number of Fisher Scoring iterations: 11

I am repeating the analysis to exclude <20 weeks gestational age.

cleaned_data$OE.Gestational.Age.Recode.10 <- factor(cleaned_data$OE.Gestational.Age.Recode.10, 
                                                    levels = c("40 weeks",
                                                               "20 - 27 weeks",
                                                               "28 - 31 weeks", 
                                                               "32 - 35 weeks", 
                                                               "36 weeks", 
                                                               "37 - 39 weeks", 
                                                               "41 weeks", 
                                                               "42 weeks or more"))
 
  # Fit the logistic regression model again
 logistic_model <- glm(NICU.Admission ~ OE.Gestational.Age.Recode.10,  
                      data = cleaned_data, 
                      family = "binomial")

# Summary of the logistic regression model
summary(logistic_model)

Call:
glm(formula = NICU.Admission ~ OE.Gestational.Age.Recode.10, 
    family = "binomial", data = cleaned_data)

Coefficients:
                                             Estimate Std. Error z value
(Intercept)                                  -0.77466    0.05032 -15.395
OE.Gestational.Age.Recode.1020 - 27 weeks     2.26539    0.11736  19.304
OE.Gestational.Age.Recode.1028 - 31 weeks     3.03935    0.12917  23.531
OE.Gestational.Age.Recode.1032 - 35 weeks     0.97723    0.06592  14.826
OE.Gestational.Age.Recode.1036 weeks          0.29016    0.06821   4.254
OE.Gestational.Age.Recode.1037 - 39 weeks     0.73556    0.06451  11.403
OE.Gestational.Age.Recode.1041 weeks         -0.81606    0.08594  -9.496
OE.Gestational.Age.Recode.1042 weeks or more -4.71840    1.00135  -4.712
                                             Pr(>|z|)    
(Intercept)                                   < 2e-16 ***
OE.Gestational.Age.Recode.1020 - 27 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1028 - 31 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1032 - 35 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1036 weeks         2.10e-05 ***
OE.Gestational.Age.Recode.1037 - 39 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1041 weeks          < 2e-16 ***
OE.Gestational.Age.Recode.1042 weeks or more 2.45e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16031  on 11643  degrees of freedom
Residual deviance: 13837  on 11636  degrees of freedom
  (9 observations deleted due to missingness)
AIC: 13853

Number of Fisher Scoring iterations: 7

Using these data, I then calculated odds ratios for gestational age on the outcome of NICU admission.

# Extract model coefficients
coefficients <- summary(logistic_model)$coefficients

# Calculate odds ratios (exponentiate the coefficients)
odds_ratios <- exp(coefficients[, "Estimate"])
lower_ci <- exp(coefficients[, "Estimate"] - 1.96 * coefficients[, "Std. Error"])
upper_ci <- exp(coefficients[, "Estimate"] + 1.96 * coefficients[, "Std. Error"])

# Add confidence intervals to the data frame
odds_ratios_df <- data.frame(
  Gestational_Age = rownames(coefficients),
  Odds_Ratio = odds_ratios,
  Lower_CI = lower_ci,
  Upper_CI = upper_ci
)

# Remove the intercept row
odds_ratios_df <- odds_ratios_df[odds_ratios_df$Gestational_Age != "(Intercept)", ]

print(odds_ratios_df)
                                                                          Gestational_Age
OE.Gestational.Age.Recode.1020 - 27 weeks       OE.Gestational.Age.Recode.1020 - 27 weeks
OE.Gestational.Age.Recode.1028 - 31 weeks       OE.Gestational.Age.Recode.1028 - 31 weeks
OE.Gestational.Age.Recode.1032 - 35 weeks       OE.Gestational.Age.Recode.1032 - 35 weeks
OE.Gestational.Age.Recode.1036 weeks                 OE.Gestational.Age.Recode.1036 weeks
OE.Gestational.Age.Recode.1037 - 39 weeks       OE.Gestational.Age.Recode.1037 - 39 weeks
OE.Gestational.Age.Recode.1041 weeks                 OE.Gestational.Age.Recode.1041 weeks
OE.Gestational.Age.Recode.1042 weeks or more OE.Gestational.Age.Recode.1042 weeks or more
                                               Odds_Ratio     Lower_CI
OE.Gestational.Age.Recode.1020 - 27 weeks     9.634903725  7.655130241
OE.Gestational.Age.Recode.1028 - 31 weeks    20.891703328 16.219049568
OE.Gestational.Age.Recode.1032 - 35 weeks     2.657084445  2.335055123
OE.Gestational.Age.Recode.1036 weeks          1.336637950  1.169371461
OE.Gestational.Age.Recode.1037 - 39 weeks     2.086655113  1.838829589
OE.Gestational.Age.Recode.1041 weeks          0.442170351  0.373624684
OE.Gestational.Age.Recode.1042 weeks or more  0.008929469  0.001254469
                                                Upper_CI
OE.Gestational.Age.Recode.1020 - 27 weeks    12.12668719
OE.Gestational.Age.Recode.1028 - 31 weeks    26.91053296
OE.Gestational.Age.Recode.1032 - 35 weeks     3.02352509
OE.Gestational.Age.Recode.1036 weeks          1.52783018
OE.Gestational.Age.Recode.1037 - 39 weeks     2.36788095
OE.Gestational.Age.Recode.1041 weeks          0.52329149
OE.Gestational.Age.Recode.1042 weeks or more  0.06356108
library(ggplot2)

ggplot(odds_ratios_df, aes(x = reorder(Gestational_Age, Odds_Ratio), y = Odds_Ratio)) +
  geom_pointrange(aes(ymin = Lower_CI, ymax = Upper_CI), 
                  color = "blue", 
                  shape = 16) +  # Add points with error bars
  coord_flip() +  # Flip axes for readability
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +  # Reference line at OR = 1
  labs(
    title = "Odds Ratios for NICU Admission",
    x = "Variable",
    y = "Odds Ratio (OR)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Improve label readability

Logistic regression with gestational age and sex

# Fit the logistic regression model with Gestational age and Sex of infant
logistic_model <- glm(NICU.Admission ~ OE.Gestational.Age.Recode.10 + Sex.of.Infant, 
                      data = cleaned_data, 
                      family = "binomial")

# Show the model summary
summary(logistic_model)

Call:
glm(formula = NICU.Admission ~ OE.Gestational.Age.Recode.10 + 
    Sex.of.Infant, family = "binomial", data = cleaned_data)

Coefficients:
                                             Estimate Std. Error z value
(Intercept)                                  -0.82738    0.05466 -15.136
OE.Gestational.Age.Recode.1020 - 27 weeks     2.26476    0.11738  19.294
OE.Gestational.Age.Recode.1028 - 31 weeks     3.04023    0.12919  23.534
OE.Gestational.Age.Recode.1032 - 35 weeks     0.97852    0.06594  14.840
OE.Gestational.Age.Recode.1036 weeks          0.28941    0.06823   4.242
OE.Gestational.Age.Recode.1037 - 39 weeks     0.73698    0.06453  11.421
OE.Gestational.Age.Recode.1041 weeks         -0.81606    0.08596  -9.494
OE.Gestational.Age.Recode.1042 weeks or more -4.72295    1.00135  -4.717
Sex.of.InfantMale                             0.10198    0.04087   2.495
                                             Pr(>|z|)    
(Intercept)                                   < 2e-16 ***
OE.Gestational.Age.Recode.1020 - 27 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1028 - 31 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1032 - 35 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1036 weeks         2.22e-05 ***
OE.Gestational.Age.Recode.1037 - 39 weeks     < 2e-16 ***
OE.Gestational.Age.Recode.1041 weeks          < 2e-16 ***
OE.Gestational.Age.Recode.1042 weeks or more 2.40e-06 ***
Sex.of.InfantMale                              0.0126 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16031  on 11643  degrees of freedom
Residual deviance: 13831  on 11635  degrees of freedom
  (9 observations deleted due to missingness)
AIC: 13849

Number of Fisher Scoring iterations: 7
# Extract model coefficients
coefficients <- summary(logistic_model)$coefficients

# Calculate odds ratios (exponentiate the coefficients)
odds_ratios <- exp(coefficients[, "Estimate"])
lower_ci <- exp(coefficients[, "Estimate"] - 1.96 * coefficients[, "Std. Error"])
upper_ci <- exp(coefficients[, "Estimate"] + 1.96 * coefficients[, "Std. Error"])

# Create a data frame of the odds ratios
odds_ratios_df <- data.frame(
  Variable = rownames(coefficients),
  Odds_Ratio = odds_ratios,
  Lower_CI = lower_ci,
  Upper_CI = upper_ci
)

# Remove the intercept row
odds_ratios_df <- odds_ratios_df[odds_ratios_df$Variable != "(Intercept)", ]

# Print the odds ratios data frame
print(odds_ratios_df)
                                                                                 Variable
OE.Gestational.Age.Recode.1020 - 27 weeks       OE.Gestational.Age.Recode.1020 - 27 weeks
OE.Gestational.Age.Recode.1028 - 31 weeks       OE.Gestational.Age.Recode.1028 - 31 weeks
OE.Gestational.Age.Recode.1032 - 35 weeks       OE.Gestational.Age.Recode.1032 - 35 weeks
OE.Gestational.Age.Recode.1036 weeks                 OE.Gestational.Age.Recode.1036 weeks
OE.Gestational.Age.Recode.1037 - 39 weeks       OE.Gestational.Age.Recode.1037 - 39 weeks
OE.Gestational.Age.Recode.1041 weeks                 OE.Gestational.Age.Recode.1041 weeks
OE.Gestational.Age.Recode.1042 weeks or more OE.Gestational.Age.Recode.1042 weeks or more
Sex.of.InfantMale                                                       Sex.of.InfantMale
                                               Odds_Ratio     Lower_CI
OE.Gestational.Age.Recode.1020 - 27 weeks     9.628855056  7.649949217
OE.Gestational.Age.Recode.1028 - 31 weeks    20.910140346 16.232686086
OE.Gestational.Age.Recode.1032 - 35 weeks     2.660523589  2.337967331
OE.Gestational.Age.Recode.1036 weeks          1.335645279  1.168456533
OE.Gestational.Age.Recode.1037 - 39 weeks     2.089616496  1.841354187
OE.Gestational.Age.Recode.1041 weeks          0.442172487  0.373612802
OE.Gestational.Age.Recode.1042 weeks or more  0.008888905  0.001248768
Sex.of.InfantMale                             1.107361060  1.022108097
                                                Upper_CI
OE.Gestational.Age.Recode.1020 - 27 weeks    12.11966865
OE.Gestational.Age.Recode.1028 - 31 weeks    26.93540471
OE.Gestational.Age.Recode.1032 - 35 weeks     3.02758113
OE.Gestational.Age.Recode.1036 weeks          1.52675625
OE.Gestational.Age.Recode.1037 - 39 weeks     2.37135100
OE.Gestational.Age.Recode.1041 weeks          0.52331319
OE.Gestational.Age.Recode.1042 weeks or more  0.06327246
Sex.of.InfantMale                             1.19972488
library(ggplot2)

ggplot(odds_ratios_df, aes(x = reorder(Variable, Odds_Ratio), y = Odds_Ratio)) +
  geom_pointrange(aes(ymin = Lower_CI, ymax = Upper_CI), 
                  color = "blue", 
                  shape = 16) +  # Add points with error bars
  coord_flip() +  # Flip axes for readability
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +  # Reference line at OR = 1
  labs(
    title = "Odds Ratios for NICU Admission",
    x = "Variable",
    y = "Odds Ratio (OR)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Improve label readability

Are males more likely to be preterm?

# Create binary variable for preterm (1) vs term (0)
cleaned_data$Preterm_Binary <- ifelse(cleaned_data$OE.Gestational.Age.Recode.10 %in% c("Under 20 weeks", "20 - 27 weeks", "28 - 31 weeks", "32 - 35 weeks", "36 weeks"), 1, 0)

# Logistic regression to test if males are more likely to be preterm
logistic_model <- glm(Preterm_Binary ~ Sex.of.Infant, data = cleaned_data, family = "binomial")
summary(logistic_model)

Call:
glm(formula = Preterm_Binary ~ Sex.of.Infant, family = "binomial", 
    data = cleaned_data)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)       -0.07395    0.02655  -2.785  0.00535 **
Sex.of.InfantMale  0.02604    0.03708   0.702  0.48249   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16144  on 11652  degrees of freedom
Residual deviance: 16143  on 11651  degrees of freedom
AIC: 16147

Number of Fisher Scoring iterations: 3

I want to visualize the number of infants born term and preterm stratified by sex.

library(ggplot2)

# Summarize the data by Sex and Preterm status
summary_data <- cleaned_data %>%
  group_by(Sex.of.Infant, Preterm_Binary) %>%
  summarize(Count = n(), .groups = "drop")

# Map Preterm_Binary to labels for the plot
summary_data$Preterm_Label <- ifelse(summary_data$Preterm_Binary == 1, "Preterm", "Term")

# Create the bar plot with Gestational Age on the x-axis
ggplot(summary_data, aes(x = Preterm_Label, y = Count, fill = Sex.of.Infant)) +
  geom_bar(stat = "identity", position = "dodge") +  # Dodge for side-by-side bars
  labs(
    title = "Number of Infants Born Term and Preterm Stratified by Sex",
    x = "Gestational Age",
    y = "Number of Infants",
    fill = "Sex of Infant"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Angle x-axis labels for better readability
    legend.position = "top"
  )

Are males more likely to be born extremely preterm (<28 wks)?

# Create a binary variable for extremely preterm (<28 weeks)
cleaned_data$Extremely_Preterm <- ifelse(cleaned_data$OE.Gestational.Age.Recode.10 == "Under 20 weeks" |
                          cleaned_data$OE.Gestational.Age.Recode.10 == "20 - 27 weeks", 1, 0)

# Fit a logistic regression model
model <- glm(Extremely_Preterm ~ Sex.of.Infant, data = cleaned_data, family = binomial)

# Summary of the model
summary(model)

Call:
glm(formula = Extremely_Preterm ~ Sex.of.Infant, family = binomial, 
    data = cleaned_data)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -2.96672    0.06150 -48.242   <2e-16 ***
Sex.of.InfantMale  0.07988    0.08446   0.946    0.344    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4686.4  on 11643  degrees of freedom
Residual deviance: 4685.5  on 11642  degrees of freedom
  (9 observations deleted due to missingness)
AIC: 4689.5

Number of Fisher Scoring iterations: 5

1.5 Discussion

Describe your results and include relevant tables, plots, and code/comments used to obtain them. You may refer to the Section 1.3 as needed. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.

1.6 Conclusion

This the conclusion. The Section 1.4 can be invoked here.

1.7 References

Bell, E.F., et al., Mortality, in-hospital morbidity, care practices, and 2-year outcomes for extremely preterm infants in the US, 2013-2018. JAMA, 2022. 327(3): p. 248-263.

Callaghan WM, MacDorman MF, Rasmussen SA, Qin C, Lackritz EM. The contribution of preterm birth to infant mortality rates in the United States. Pediatrics. 2006;118(4):1566-1573.

Thebaud, B., et al., Bronchopulmonary dysplasia. Nat Rev Dis Primers, 2019. 5(1): p. 78.

van Westering-Kroon, E., et al., Male Disadvantage in Oxidative Stress-Associated Complications of Prematurity: A Systematic Review, Meta-Analysis and Meta-Regression. Antioxidants (Basel), 2021. 10(9).

Inkster AM, Fernández-Boyano I, Robinson WP. Sex differences are here to stay: relevance to prenatal care. Journal of Clinical Medicine. 2021 Jul 5;10(13):3000.

NOT-OD-15-102: Consideration of Sex as a Biological Variable in NIH-Funded Research. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-15-102.html. Accessed 10 Dec. 2024.

Albert, R., Lee, A., Lingappan, K. Response to Therapeutic Interventions in the NICU: Role of Sex as a Biological Variable. 2023. Neoreviews, 24(12), e797-e805.

=======
pretetrm_birth_rates_map

#How many counties are represented in this analysis?
num_unique_geoid_counties_preterm <- length(unique(county_map_preterm_data$GEOID))
print(num_unique_geoid_counties_preterm)
[1] 496
#Which counties have the highest and lowest preterm birth rates?
# Find the county with the highest preterm Birth Rate and its value (show only the first entry)
highest_preterm_birth_rate <- county_map_preterm_data %>%
  filter(Preterm.Birth.Rate == max(Preterm.Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

cat("County with the highest preterm Birth Rate: ", highest_preterm_birth_rate$County.of.Residence, "\n")
County with the highest preterm Birth Rate:  Kanawha County, WV 
cat("Highest preterm Birth Rate: ", highest_preterm_birth_rate$Preterm.Birth.Rate, "\n")
Highest preterm Birth Rate:  1.178862 
# Find the county with the lowest preterm Birth Rate and its value (show only the first entry)
lowest_preterm_birth_rate <- county_map_preterm_data %>%
  filter(Preterm.Birth.Rate == min(Preterm.Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

cat("County with the lowest preterm Birth Rate: ", lowest_preterm_birth_rate$County.of.Residence, "\n")
County with the lowest preterm Birth Rate:  Ocean County, NJ 
cat("Lowest preterm Birth Rate: ", lowest_preterm_birth_rate$Preterm.Birth.Rate, "\n")
Lowest preterm Birth Rate:  0.2631598 

Describe your results and include relevant tables, plots, and code/comments used to obtain them. You may refer to the Section 1.3 as needed. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.

1.5 Conclusion

This the conclusion. The Section 1.4 can be invoked here.

>>>>>>> f88472e3a542e8cb88557f16a8d613ce8a791472